## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
## 
## Attaching package: 'ggthemes'
## The following object is masked from 'package:mosaic':
## 
##     theme_map
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:mosaic':
## 
##     n_missing
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:mosaic':
## 
##     logit, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

1 Dataset creation

We created our own dataset from the Imdb database (https://datasets.imdbws.com). The dataset we import bellow is the result of multiple filterings and dataset mergings (which we performed using Python in order to capitalize on Panda’s vectorization (through numpy) feature allowing us to deal with the dozens of millions of rows and exported the dataset enriched with others). We used the following datasets:

1.0.1 title.basics.tsv

Contains the following information for titles:

  • tconst (string) - alphanumeric unique identifier of the title
  • titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
  • primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional - materials at the point of release
  • originalTitle (string) - original title, in the original language
  • isAdult (boolean) - 0: non-adult title; 1: adult title
  • startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series - start year
  • endYear (YYYY) – TV Series end year. ‘’ for all other title types
  • runtimeMinutes – primary runtime of the title, in minutes genres (string array) – includes up to three genres associated with the title

1.0.2 title.principals.tsv

Contains the principal cast/crew for titles

  • tconst (string) - alphanumeric unique identifier of the title
  • ordering (integer) – a number to uniquely identify rows for a given titleId
  • nconst (string) - alphanumeric unique identifier of the name/person
  • category (string) - the category of job that person was in
  • job (string) - the specific job title if applicable, else ‘’
  • characters (string) - the name of the character played if applicable, else ‘’

1.0.3 title.ratings.tsv

Contains the IMDb rating and votes information for titles

  • tconst (string) - alphanumeric unique identifier of the title
  • averageRating – weighted average of all the individual user ratings
  • numVotes - number of votes the title has received

1.0.4 name.basics.tsv

Contains the following information for names

  • nconst (string) - alphanumeric unique identifier of the name/person
  • primaryName (string)– name by which the person is most often credited
  • birthYear – in YYYY format
  • deathYear – in YYYY format if applicable, else ‘’
  • primaryProfession (array of strings)– the top-3 professions of the person
  • knownForTitles (array of tconsts) – titles the person is known for

After several filterings and droppings of columns not relevant to our analysis which focuses on the variables which could explain the average rating for a title, we end up with the following dataset (which is significantly smaller than the merged datasets). We would like to use the following variables:

Data summary
Name Imdb
Number of rows 850890
Number of columns 17
_______________________
Column type frequency:
character 13
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tconst 0 1 9 10 0 309270 0
titleType 0 1 5 12 0 10 0
primaryTitle 0 1 1 218 0 256254 0
startYear 0 1 2 4 0 130 0
endYear 0 1 2 4 0 75 0
runtimeMinutes 0 1 1 4 0 551 0
genres 0 1 2 31 0 1387 0
seasonNumber 0 1 0 4 493710 66 0
episodeNumber 0 1 0 5 493710 2387 0
category 0 1 0 7 1 3 0
primaryName 0 1 0 47 1 58364 0
birthYear 0 1 0 4 1 157 0
deathYear 0 1 0 4 1 119 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 425444.50 245630.93 0 212722.2 425444.5 638166.8 850889 ▇▇▇▇▇
isAdult 0 1 0.01 0.11 0 0.0 0.0 0.0 1 ▇▁▁▁▁
averageRating 0 1 6.76 1.22 1 6.0 6.9 7.6 10 ▁▁▅▇▁
numVotes 1 1 1156.08 17799.11 5 11.0 31.0 112.0 2331548 ▇▁▁▁▁
## [1] "/Users/cheiksamassa/Desktop"

2 Data cleaning and processing

2.1 General transformations

We first need to transform a few columns: We decide to drop any row for the variables in which we are interested in analysing that contains missing values. At first sight, after executing the skim function, it appears that there are no missing rows from our dataset. The problem is that in fact some columns containing obvious quantitative variables such as runtimeMinutes are currently text columns. The empty rows are labelled as “” which is why R doesn’t identify them as empty (NA). We first need to make these rows empty (with "") and then we can convert them to the numeric class. We drop also the values for which we don’t have the year(startYear especisally and actor birthYear) by using the same process.

We also decided that it would be relevant to focus our analysis on the actors (and actresses) instead of both the acting crew and the directors and producers and writters. The main reason behind this choice is to reduce the size of the dataframe which is above 4 million rows (Computations will require a lot of time, and the memory of our computer cannot handle ths amount of data for long).

We use the computations we previously did to calculate the age of the actor when the title was published (for movies it’s the year in which the publication took place and for titles such as tvSeries it’s the year in which the publication of the first episode took place). The variable created as a result is age_at_start. We also need to transform the categorical variables into dummy variables isAdult and Actor_gender being the only qualitative columns we identified

In order to further reduce the size of our Dataframe and make it more relevant to inference analysis, we have decided to narrow the title types to only Movies and TvSeries (it might be appropriate to seperate those two title types later and analyze them seperately)

2.2 Gender distribution

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Actor_gender avg_age
##   <fct>          <dbl>
## 1 0               44.2
## 2 1               35.9
## Warning in self$trans$transform(x): production de NaN
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 56 rows containing non-finite values (stat_boxplot).

## Prepare dataset for linear reggression analysis

We have duplicate rows in our datasets: a movie/tv series title will be replicated as many times as there are actors/actresses: in fact, each row corresponds not to a title but to an actor/actress who have been cast in that show. It doesn’t mean that each row corresponds to an actor/actress per se, but since an actor cannot be hired twice on the same movie/tvSeries he/she will appear at most one time per title.

We need to summarize our dataset such that each row corresponds to a title. For that we would ultimately need to summarize the actor/actress related columns which are - Actor_gender - PrimaryName - birthYear - deathYear - age_at_start (=>age of the actor when the movie was produced/ when the TvSeries started to be produced)

We will keep the Actor_gender which is currently a categorical variable (0 for actor 1 for Actress) and age_at_start. Since each row corresponds to either 0 or 1 we will compute a ratio of Actors:Actresses (we will try to check for the effects of this ratio on the title’s rating) For the age_at_start we will simply compute the average age(regardless of the gender) of the actors (and check for it’s effect on ratings)

To construct our final Dataframe we will create a series of dataframes which we will merge together (by joining data) using the tconst variable which is the unique ID of each title. Since we are calculating a ratio, this means that if a movie or tvseries has exlusively only actors or actresses we exclude those titles because our computation might try to divide by 0

## `summarise()` regrouping output by 'tconst' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##              tconst           titleType        primaryTitle             isAdult 
##         "character"         "character"         "character"            "factor" 
##           startYear      runtimeMinutes              genres       averageRating 
##           "numeric"           "numeric"         "character"           "numeric" 
##            numVotes             Avg_age ratio_actor_actress 
##           "numeric"           "numeric"           "numeric"
## # A tibble: 113 x 2
## # Groups:   startYear [113]
##    startYear     n
##        <dbl> <int>
##  1      1906     1
##  2      1909     1
##  3      1910     1
##  4      1911     9
##  5      1912    12
##  6      1913    33
##  7      1914    62
##  8      1915    86
##  9      1916   139
## 10      1917   134
## # … with 103 more rows
## # A tibble: 6 x 11
## # Groups:   tconst [6]
##   tconst titleType primaryTitle isAdult startYear runtimeMinutes genres
##   <chr>  <chr>     <chr>        <fct>       <dbl>          <dbl> <chr> 
## 1 tt000… movie     The Story o… 0            1906             70 "Biog…
## 2 tt000… movie     The White S… 0            1910             45 "Dram…
## 3 tt000… movie     The Life of… 0            1909             50 "Biog…
## 4 tt000… movie     The Battle … 0            1911             51 "War" 
## 5 tt000… movie     In the Prim… 0            1911             45 "\\N" 
## 6 tt000… movie     Der fremde … 0            1911             45 "\\N" 
## # … with 4 more variables: averageRating <dbl>, numVotes <dbl>, Avg_age <dbl>,
## #   ratio_actor_actress <dbl>

2.3 Should we analyze movies and tvSeries at the same time together in the same dataset ?

As we stated earlier, we have decided to extract the titles that are either movies or TvSeries and focus our analysis on these types of title which are also those for which we have the most important number of rows. But would it be relevant to analyze movies and series as a whole: for example the runtimeMinutes variable which captures the duration of a title will be very different for tvSeries and for movies, thus the effect of this variable on the average rating will be different. If tvSeries have very different characteristics than movies it might not be a good idea to analyze them together. Let us see if we can clarify this assumption using three different tools:

  1. Boxplots
  2. Standard deviation
  3. overall distribution

We will compute these for the average rating variable and runtimeMinute variable.

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
##   titleType   min   max average median   iqr    q1    q3 stdev
##   <chr>     <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 movie       1     9.6    6.10    6.2  1.30   5.5   6.8  1.03
## 2 tvSeries    1.6   9.4    7.10    7.3  1.2    6.6   7.8  1.08

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
##   titleType   min   max average median   iqr    q1    q3 stdev
##   <chr>     <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 movie         6  1500    92.6     90    21    80   101  26.0
## 2 tvSeries      1  2925    54.0     30    30    30    60  95.9

2.3.1 Average ratings

The average ratings are somewhat different for movies and tvSeries: They are one point higher on average for tvSeries but they approximately have the samme iqr (with 50% of the ratings being within 1.3 for movies and 1.2 for series.) The standard deviation is also different but somewhat close: with 1.03 for movies and 1.08 for series which are slighly farther from their mean.

2.3.2 runtimeMinutes

For the duration of titles, the conclusions we can make are more straighforward: the disparities are obvious. We see that the boxplots are completely different. For movies we see that all the points are very close to 100 minutes with an average of 92.5 minutes: 50% of the movies have a duration between 80 and 101 minutes (iqr of 21). We see that tvSeries have an average duration which is as expected significantly lower but the iqr is larger as well as the standard deviation although they are shorter: 50 % of the tvSeries have a duration of between 30 and 60 minutes and the datapoints are much more widespread than for movies as suggest the standard deviation which is 4 times higher for tvSeries (96 minutes) than for movies (26 minutes).

If we compare the distribution of both title types using a histogram, we can see that the shape of their distribution also differ for our dependant variable (averageRating): The movies distribution is close to a symetrical normal distribution while tvSeries are skewed on the left.

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

For all these reasons, it seems more appropriate to split the dataset and analyze these two title types seperately.

## [1] "Number of rows in Movies DF\n"
## [1] 55497
## [1] "Number of rows in Series DF\n"
## [1] 5232
## [1] TRUE

2.4 Outlier exclusion

From the boxplots that we have computed before, we noticed that somepoints were extremely high or extremely low. This might be recurrent in our dataset. In order to thwart the effects that these outliers may have on our future modeling processes, we have decided to compute our own outlier exclusion fonction.

  1. It takes a dataset and a numerical variable from this dataset as an input.
  2. It calculates the first and fourth quartiles
  3. it calculates the Interquartile range
  4. It computes the range beyond and bellow which the data are considered outliers (Q3 +- 1.5*iqr)
  5. It selects the data which are within these two ranges.

The function is far from perfect (for example it doesn’t perform error handling) but it will serve our purpose very well later.

3 Linear regression modelling.

As we stated before, our objective with the linear regression is to find the variables which help predict the average rating of a title in the IMDB database. Our initial model would look like something like that: Y= B + X1+X2+X3+X4+X5

Y= averageRating: what are the variables which explain the rating for a given title (The ratings would be an indicator of the quality of a movie overall or its appreciation)

X1=numVotes: What is the effect of a title’s popularity (let’s assume that the more votes a title gets, the more popular it is) on its rating? Are the most popular titles also the best ones ?

X2= runtimeMinute: What is the effect of a title’s duration on its quality (rating) Does longer means better/more appreciated ?

X3= ratio_actor_actress: What effect does having more or less male or female actors on the quality/appreciation of a movie ? Are movies with a perfectly balanced actor/actress ratio the best ?

X4= startYear: Does time have an effect on a movie’s rating ? Are movies becoming better or worse ?

X5=Avg_age: What effect does the average actor age have on a title’s rating ? X6= isAdult: What are the effect of adult or non adult movies on the rating ? Are movies targetted for adults more appreciated than the public ones ?

Here, when formulating these hypothesis, we assumed that there was a relationship between these variables and the rating. Let now us try to investigate this assumption. Before deep diving into our regression model, let’s first compute the correlation coefficients for each of our variables.

3.1 Correlation coefficients

## [1] "startYear"           "runtimeMinutes"      "averageRating"      
## [4] "numVotes"            "Avg_age"             "ratio_actor_actress"
## Call:psych::corr.test(x = final_df_movies_num, y = NULL, use = "pairwise", 
##     method = "pearson", adjust = "holm", alpha = 0.05)
## Correlation matrix 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                1.00           0.28         -0.07     0.08    0.44
## runtimeMinutes           0.28           1.00          0.18     0.08    0.09
## averageRating           -0.07           0.18          1.00     0.11   -0.04
## numVotes                 0.08           0.08          0.11     1.00    0.05
## Avg_age                  0.44           0.09         -0.04     0.05    1.00
## ratio_actor_actress     -0.20          -0.07          0.00     0.00    0.04
##                     ratio_actor_actress
## startYear                         -0.20
## runtimeMinutes                    -0.07
## averageRating                      0.00
## numVotes                           0.00
## Avg_age                            0.04
## ratio_actor_actress                1.00
## Sample Size 
## [1] 55497
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                   0              0           0.0        0       0
## runtimeMinutes              0              0           0.0        0       0
## averageRating               0              0           0.0        0       0
## numVotes                    0              0           0.0        0       0
## Avg_age                     0              0           0.0        0       0
## ratio_actor_actress         0              0           0.7        1       0
##                     ratio_actor_actress
## startYear                             0
## runtimeMinutes                        0
## averageRating                         1
## numVotes                              1
## Avg_age                               0
## ratio_actor_actress                   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
## [1] "startYear"           "runtimeMinutes"      "averageRating"      
## [4] "numVotes"            "Avg_age"             "ratio_actor_actress"
## Call:psych::corr.test(x = final_df_series_num, y = NULL, use = "pairwise", 
##     method = "pearson", adjust = "holm", alpha = 0.05)
## Correlation matrix 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                1.00           0.08         -0.26     0.08    0.68
## runtimeMinutes           0.08           1.00          0.01    -0.01    0.06
## averageRating           -0.26           0.01          1.00     0.08   -0.17
## numVotes                 0.08          -0.01          0.08     1.00    0.05
## Avg_age                  0.68           0.06         -0.17     0.05    1.00
## ratio_actor_actress     -0.34           0.01          0.13    -0.02   -0.15
##                     ratio_actor_actress
## startYear                         -0.34
## runtimeMinutes                     0.01
## averageRating                      0.13
## numVotes                          -0.02
## Avg_age                           -0.15
## ratio_actor_actress                1.00
## Sample Size 
## [1] 5232
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                   0           0.00          0.00     0.00       0
## runtimeMinutes              0           0.00          0.97     0.97       0
## averageRating               0           0.47          0.00     0.00       0
## numVotes                    0           0.32          0.00     0.00       0
## Avg_age                     0           0.00          0.00     0.00       0
## ratio_actor_actress         0           0.34          0.00     0.13       0
##                     ratio_actor_actress
## startYear                          0.00
## runtimeMinutes                     0.97
## averageRating                      0.00
## numVotes                           0.51
## Avg_age                            0.00
## ratio_actor_actress                0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

These are not very satisfying correlation coefficients: for movies, the most interesting correlation with the average_rating column are: Startyear, runTimeMinutes and numVotes

For tvSeries: startYear, numvotes and ratio_actor_actress

Let’s further investigate the way our dependant variable behaves with others by plotting a few scatterplots

3.2 Scatterplots

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

It is hard to interpret this plot given the multitude of datapoints present on it. In order to make it easier we decide to sample our datasets

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We see that in some cases, the line is just flat, there appears to be no linear relationship between the dependant variable and the others. It means that for these variables, there is no relationship as the regression line is almost not slopped at all. For instance, the average ratings, seem to be irresponsive to the average Age variable for both movies and tvSeries.

For other variables, there appears to be a relashionship but it’s not linear. Also there seem to be outliers which we might need to take care of later.

First, let’s try to compute the regression model for each title type assuming a linear model.

3.2.1 Movies

## 
## Call:
## lm(formula = averageRating ~ numVotes + runtimeMinutes + Avg_age + 
##     startYear + ratio_actor_actress + isAdult, data = final_df_movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0083  -0.5317   0.1203   0.6767   3.8955 
## 
## Coefficients:
##                          Estimate    Std. Error t value             Pr(>|t|)
## (Intercept)         17.9785178937  0.4848558782  37.080 < 0.0000000000000002
## numVotes             0.0000044788  0.0000001765  25.371 < 0.0000000000000002
## runtimeMinutes       0.0079832205  0.0001719561  46.426 < 0.0000000000000002
## Avg_age             -0.0013253211  0.0005475872  -2.420             0.015511
## startYear           -0.0064028079  0.0002534877 -25.259 < 0.0000000000000002
## ratio_actor_actress -0.0137644954  0.0041013032  -3.356             0.000791
## isAdult1            -0.2345235001  0.0345050894  -6.797      0.0000000000108
##                        
## (Intercept)         ***
## numVotes            ***
## runtimeMinutes      ***
## Avg_age             *  
## startYear           ***
## ratio_actor_actress ***
## isAdult1            ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 55490 degrees of freedom
## Multiple R-squared:  0.058,  Adjusted R-squared:  0.0579 
## F-statistic: 569.4 on 6 and 55490 DF,  p-value: < 0.00000000000000022
## 
## Call:
## lm(formula = averageRating ~ numVotes + runtimeMinutes + startYear + 
##     ratio_actor_actress + isAdult, data = final_df_movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0187  -0.5309   0.1205   0.6774   3.8872 
## 
## Coefficients:
##                          Estimate    Std. Error t value             Pr(>|t|)
## (Intercept)         18.4828632484  0.4378115437  42.216 < 0.0000000000000002
## numVotes             0.0000044733  0.0000001765  25.341 < 0.0000000000000002
## runtimeMinutes       0.0080051272  0.0001717252  46.616 < 0.0000000000000002
## startYear           -0.0066863945  0.0002247922 -29.745 < 0.0000000000000002
## ratio_actor_actress -0.0150598247  0.0040664113  -3.703             0.000213
## isAdult1            -0.2195589460  0.0339480952  -6.467         0.0000000001
##                        
## (Intercept)         ***
## numVotes            ***
## runtimeMinutes      ***
## startYear           ***
## ratio_actor_actress ***
## isAdult1            ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 55491 degrees of freedom
## Multiple R-squared:  0.0579, Adjusted R-squared:  0.05781 
## F-statistic: 682.1 on 5 and 55491 DF,  p-value: < 0.00000000000000022

3.2.2 Series

## 
## Call:
## lm(formula = averageRating ~ numVotes + runtimeMinutes + Avg_age + 
##     startYear + ratio_actor_actress, data = final_df_series)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4553 -0.4956  0.1319  0.6839  2.7205 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         44.187913710  2.751233038  16.061 < 0.0000000000000002 ***
## numVotes             0.000013273  0.000001655   8.018  0.00000000000000131 ***
## runtimeMinutes       0.000351889  0.000149741   2.350              0.01881 *  
## Avg_age             -0.000127507  0.001887532  -0.068              0.94614    
## startYear           -0.018777105  0.001416106 -13.260 < 0.0000000000000002 ***
## ratio_actor_actress  0.022009415  0.007899572   2.786              0.00535 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.035 on 5226 degrees of freedom
## Multiple R-squared:  0.08079,    Adjusted R-squared:  0.07991 
## F-statistic: 91.86 on 5 and 5226 DF,  p-value: < 0.00000000000000022
## 
## Call:
## lm(formula = averageRating ~ numVotes + startYear + ratio_actor_actress + 
##     runtimeMinutes, data = final_df_series)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4545 -0.4961  0.1323  0.6844  2.7207 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         44.310002826  2.074132414  21.363 < 0.0000000000000002 ***
## numVotes             0.000013275  0.000001655   8.021  0.00000000000000129 ***
## startYear           -0.018841644  0.001045165 -18.027 < 0.0000000000000002 ***
## ratio_actor_actress  0.021945557  0.007842060   2.798              0.00515 ** 
## runtimeMinutes       0.000351880  0.000149726   2.350              0.01880 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.034 on 5227 degrees of freedom
## Multiple R-squared:  0.08079,    Adjusted R-squared:  0.08009 
## F-statistic: 114.9 on 4 and 5227 DF,  p-value: < 0.00000000000000022

The r^2 is tremendously low here, the percentage of the variance in the dependent variable (average Rating) that the independent variables explain collectively is bellow 10% for both datatsets (5.8% for movies and 8% for series). These low R squared are not really surprising given the poor coefficients we got with the computed correlation coefficients: all of the variables had relatively low (<0.15) correlation coefficients. On the other hand, we have very significant p values for almost all the predictors. We may try to improve the model by performing a stepwise regression: remove the least significant predictor and re-estimate the model.

We notice on the scatterplots that the relationship between our dependant variable and the other explanatory ones is not really linear.

When we look at the residual plots for each dataset (has more points than the scatterplots above because the scatterplots use a sample of the data), we can notice a non linear shape for when the residuals are close to zero. We should have residual datapoints that are scattered randomly around zero. We notice that we have some outliers on the plot which might bias the regression line. Let’s compute a few boxplots to identify the variables which might cause these outliers and get rid of them

Let’s first check for outliers presence in our dataset using boxplots for our quantitative variables.These variables are the following:

## [1] "startYear"           "runtimeMinutes"      "averageRating"      
## [4] "numVotes"            "Avg_age"             "ratio_actor_actress"

## Warning in self$trans$transform(x): production de NaN
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Let’s remove the outliers for the variables that are already in our regression model: For movies: - AverageRatings - numVotes - runtimeMinutes -startYear -ratio_actor_actress

For series:

  • averageRatings
  • numVotes -runtimeMinutes -startYear -ratio_actor_actress
## [1] "number of rows before exclusion for movies"
## [1] 55497
## [1] "number of rows before exclusion for tvSeries"
## [1] 5232
## [1] "number of rows after exclusion for movies"
## [1] 43416
## [1] "number of rows after exclusion for tvSeries"
## [1] 3992

Now the residuals are symmetrically distributed, tending to cluster towards the middle of the plot, they are clustered around the lower single digits of the y-axis and there aren’t any clear patterns. They are randomly distributed (thus no heterodascity as well).

The downside of these outlier exclusions is that we our already low R squared is now even lower: it has lost 1.5 percents for each titileType on average. As the most extreme data point were removed, (loss of data), the part of the variation of our data explained by the model has also decreased.

In fact, when we removed the outliers (which aren’t wrongful, but simply unusual observations) the (already weak) relationship we had between the variables has also been removed. We couldn’t expect our R squared (which is a measure of to what extend our explanatory variables explain the variance of the dependent one) to be higher with variables that have less correlation.

## Call:psych::corr.test(x = final_df_movies_num, y = NULL, use = "pairwise", 
##     method = "pearson", adjust = "holm", alpha = 0.05)
## Correlation matrix 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                1.00           0.28         -0.07     0.08    0.44
## runtimeMinutes           0.28           1.00          0.18     0.08    0.09
## averageRating           -0.07           0.18          1.00     0.11   -0.04
## numVotes                 0.08           0.08          0.11     1.00    0.05
## Avg_age                  0.44           0.09         -0.04     0.05    1.00
## ratio_actor_actress     -0.20          -0.07          0.00     0.00    0.04
##                     ratio_actor_actress
## startYear                         -0.20
## runtimeMinutes                    -0.07
## averageRating                      0.00
## numVotes                           0.00
## Avg_age                            0.04
## ratio_actor_actress                1.00
## Sample Size 
## [1] 55497
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                   0              0           0.0        0       0
## runtimeMinutes              0              0           0.0        0       0
## averageRating               0              0           0.0        0       0
## numVotes                    0              0           0.0        0       0
## Avg_age                     0              0           0.0        0       0
## ratio_actor_actress         0              0           0.7        1       0
##                     ratio_actor_actress
## startYear                             0
## runtimeMinutes                        0
## averageRating                         1
## numVotes                              1
## Avg_age                               0
## ratio_actor_actress                   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
## 
## Call:
## lm(formula = averageRating ~ numVotes + runtimeMinutes + startYear + 
##     ratio_actor_actress + isAdult, data = final_df_movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.96380 -0.54677  0.05865  0.60504  2.96241 
## 
## Coefficients:
##                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         21.45031482  0.46769851  45.864 < 0.0000000000000002 ***
## numVotes             0.00026436  0.00002684   9.850 < 0.0000000000000002 ***
## runtimeMinutes       0.00932437  0.00030216  30.859 < 0.0000000000000002 ***
## startYear           -0.00828274  0.00024391 -33.958 < 0.0000000000000002 ***
## ratio_actor_actress -0.01547128  0.00401473  -3.854             0.000117 ***
## isAdult1            -0.09215353  0.03072053  -3.000             0.002704 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.876 on 43410 degrees of freedom
## Multiple R-squared:  0.03876,    Adjusted R-squared:  0.03865 
## F-statistic: 350.1 on 5 and 43410 DF,  p-value: < 0.00000000000000022
## Call:psych::corr.test(x = final_df_series_num, y = NULL, use = "pairwise", 
##     method = "pearson", adjust = "holm", alpha = 0.05)
## Correlation matrix 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                1.00           0.08         -0.26     0.08    0.68
## runtimeMinutes           0.08           1.00          0.01    -0.01    0.06
## averageRating           -0.26           0.01          1.00     0.08   -0.17
## numVotes                 0.08          -0.01          0.08     1.00    0.05
## Avg_age                  0.68           0.06         -0.17     0.05    1.00
## ratio_actor_actress     -0.34           0.01          0.13    -0.02   -0.15
##                     ratio_actor_actress
## startYear                         -0.34
## runtimeMinutes                     0.01
## averageRating                      0.13
## numVotes                          -0.02
## Avg_age                           -0.15
## ratio_actor_actress                1.00
## Sample Size 
## [1] 5232
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                     startYear runtimeMinutes averageRating numVotes Avg_age
## startYear                   0           0.00          0.00     0.00       0
## runtimeMinutes              0           0.00          0.97     0.97       0
## averageRating               0           0.47          0.00     0.00       0
## numVotes                    0           0.32          0.00     0.00       0
## Avg_age                     0           0.00          0.00     0.00       0
## ratio_actor_actress         0           0.34          0.00     0.13       0
##                     ratio_actor_actress
## startYear                          0.00
## runtimeMinutes                     0.97
## averageRating                      0.00
## numVotes                           0.51
## Avg_age                            0.00
## ratio_actor_actress                0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
## 
## Call:
## lm(formula = averageRating ~ numVotes + startYear + ratio_actor_actress + 
##     runtimeMinutes, data = final_df_series)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65179 -0.54446  0.06101  0.59694  2.50035 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         39.0166200  2.0462939  19.067 < 0.0000000000000002 ***
## numVotes             0.0004523  0.0001406   3.216              0.00131 ** 
## startYear           -0.0161758  0.0010354 -15.622 < 0.0000000000000002 ***
## ratio_actor_actress  0.0037084  0.0073438   0.505              0.61360    
## runtimeMinutes       0.0035265  0.0007850   4.492           0.00000725 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8423 on 3987 degrees of freedom
## Multiple R-squared:  0.0678, Adjusted R-squared:  0.06687 
## F-statistic:  72.5 on 4 and 3987 DF,  p-value: < 0.00000000000000022

For a better R squared, we could reload the datasets we had before the outlier exclusion but we would have the same problem we had earlier: non linear relashionships between some of our independant variables and the depedant variable (as depicted in the scatterplots we drew before). Indeed, we see on some scatterplots that the curves appear to have different slopes. In order to use the linear model for these type of non liner relationships, we need to linearize some relationships by using a logarithmic transformation for those variables.

## [1] "Number of rows in Movies DF\n"
## [1] 55497
## [1] "Number of rows in Series DF\n"
## [1] 5232
## 
## Call:
## lm(formula = averageRating ~ log(numVotes) + runtimeMinutes + 
##     log(startYear) + ratio_actor_actress, data = final_df_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7484 -0.5227  0.0913  0.6424  4.2216 
## 
## Coefficients:
##                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         132.6612984   3.2774715  40.477 < 0.0000000000000002 ***
## log(numVotes)         0.1177875   0.0021013  56.054 < 0.0000000000000002 ***
## runtimeMinutes        0.0071212   0.0001682  42.348 < 0.0000000000000002 ***
## log(startYear)      -16.8463275   0.4328546 -38.919 < 0.0000000000000002 ***
## ratio_actor_actress  -0.0271927   0.0039845  -6.825     0.00000000000891 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9819 on 55492 degrees of freedom
## Multiple R-squared:  0.09728,    Adjusted R-squared:  0.09722 
## F-statistic:  1495 on 4 and 55492 DF,  p-value: < 0.00000000000000022
## 
## Call:
## lm(formula = averageRating ~ log(numVotes) + log(startYear) + 
##     runtimeMinutes, data = final_df_series)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2304 -0.5261  0.1109  0.6739  3.0043 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    322.4967903  14.6480424  22.016 < 0.0000000000000002 ***
## log(numVotes)    0.1010543   0.0072419  13.954 < 0.0000000000000002 ***
## log(startYear) -41.6106166   1.9304251 -21.555 < 0.0000000000000002 ***
## runtimeMinutes   0.0004078   0.0001479   2.757              0.00586 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 5228 degrees of freedom
## Multiple R-squared:  0.1014, Adjusted R-squared:  0.1009 
## F-statistic: 196.7 on 3 and 5228 DF,  p-value: < 0.00000000000000022

We performed a logarithmic transformation on the following variables which as depicted on the scatterplots have a non linear relationship wih the dependant variable: numVotes, runtimeMinutes and startYear.

The logarithmic transformation has had more effect on the movies dataset than on tvSeries: the R squared went from 7.9 to 9.7% for movies,for tvSeries it went from 8 to 10.1%

3.3 Final interpretation

Our final models are the following ones:

3.3.1 Movies

Y= b1X1+b2X2+b3X3+b4X4

averageRating= b1* log(numVotes)+ b2* runtimeMinutes b3* log(startYear) + b4 ratio_actor_actress

The variables of this model explain 9.7% of the variation of the average rating variable.

  • For each change of one vote for a movie title, the average change in the mean of the average rating is about 0.12 with everything else held constant

  • For a change of one minute a movie title’s duration, the average change in the mean of the average rating is about 0.007

  • For a change of one year for a movie title’s publishing year, the average change in the mean of the average rating is about -16.8

  • For a change of one (male actor) in the actor/actress ratio, the average change in the mean of the average rating is about 0.03

3.3.2 Series

Y= b1X1+b2X2+b3X3

averageRating=b1* log(numVotes) + b2* log(startYear) + b3* runtimeMinutes

The variables of this model explain 10.1% of the variation of the average rating variable.

  • For each change of one vote for a movie title, the average change in the mean of the average rating is about 0.10 with everything else held constant

  • For a change of one year for a movie title’s publishing year, the average change in the mean of the average rating is about -41.6

  • For a change of one minute a movie title’s duration, the average change in the mean of the average rating is about 0.0004

As mentioned before, we have in the end a relatively low r^2. A low r^2 doesn’t necessarily means that our model is ultimately bad or that the dataset is of bad uality: Maybe in addition to the variables that we have identified, we are missing a few others which would help better explain the variation of the average rating. Also, the average rating variable is measure of people’s preferences and maybe, and predicting what people like and dislike the task becomes complex because people are complex. It is likely that creating a somewhat reliable predictor of people’s appreciation for a movie or tvSeries, would require a more complex much more variables. An r^2 of 9 or 10 percent might also be enough as suggested by Falk and Miller (1992) who recommended that r^2 values should be equal to or greater than 0.10 in order for the variance explained of a particular endogenous construct to be deemed adequate (https://www.researchgate.net/post/what_is_a_good_r_square_value_in_regression_analysis/1).